NSF PAR Search | NSF Public Access Repository

Deepfakes, or synthetic audiovisual media developed with the intent to deceive, are growing increasingly prevalent. Existing methods, employed independently as images/patches or jointly as tubelets, have, up to this point, typically focused on spatial or spatiotemporal inconsistencies. However, the evolving nature of deepfakes demands a holistic approach. Inspection of a given multimedia sample with the intent to validate its authenticity, without adding significant computational overhead has, to date, not been fully explored in the literature. In addition, no work has been done on the impact of different inconsistency dimensions in a single framework. This paper tackles the deepfake detection problem holistically. HolisticDFD, a novel, transformer-based, deepfake detection method which is both lightweight and compact, intelligently combines embeddings from the spatial, temporal and spatiotemporal dimensions to separate deepfakes from bonafide videos. The proposed system achieves 0.926 AUC on the DFDC dataset using just 3% of the parameters used by state-ofthe-art detectors. An evaluation against other datasets shows the efficacy of the proposed framework, and an ablation study shows that the performance of the system gradually improves as embeddings with different data representations are combined. An implementation of the proposed model is available at: https://github.com/smileslab/deepfake-detection/.

Multimodaltrace: Deepfake Detection using Audiovisual Representation Learning

Muhammad Anas Raza Khalid Mahmood Malik (June 2023, IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, CA, 2023)

By employing generative deep learning techniques, Deepfakes are created with the intent to create mistrust in society, manipulate public opinion and political decisions, and for other malicious purposes such as blackmail, scamming, and even cyberstalking. As realistic deepfake may involve manipulation of either audio or video or both, thus it is important to explore the possibility of detecting deepfakes through the inadequacy of generative algorithms to synchronize audio and visual modalities. Prevailing performant methods, either detect audio or video cues for deepfakes detection while few ensemble the results after predictions on both modalities without inspecting relationship between audio and video cues. Deepfake detection using joint audiovisual representation learning is not explored much. Therefore, this paper proposes a unified multimodal framework, Multimodaltrace, which extracts learned channels from audio and visual modalities, mixes them independently in IntrAmodality Mixer Layer (IAML), processes them jointly in IntErModality Mixer Layers (IEML) from where it is fed to multilabel classification head. Empirical results show the effectiveness of the proposed framework giving state-of-the-art accuracy of 92.9% on the FakeAVCeleb dataset. The cross-dataset evaluation of the proposed framework on World Leaders and Presidential Deepfake Detection Datasets gives an accuracy of 83.61% and 70% respectively. The study also provides insights into how the model focuses on different parts of audio and visual features through integrated gradient analysis

Full Text Available

Search for: All records